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Abstract: We propose a new method of perturbing a major variable by 
adding noise such that results of regression analysis are unaffected. The ex¬ 
tent of the perturbation can be controlled using a single parameter, which 
eases an actual perturbation application. On the basis of results of a nu¬ 
merical experiment, we recommend an appropriate value of the parameter 
that can achieve both sufficient perturbation to mask original values and 
sufficient coherence between perturbed and original data. 


1. Introduction 

Increasing amounts of information are now circulated because of recent advance¬ 
ments in digitalization, thereby increasing the importance of protecting personal 
information. Information that can identify a person should not be publicized or 
utilized without the person’s consent. In the case of information regarding real 
estate, the location can be identified by combining several information sources, 
which in turn might be used to identify a person, such as an owner or resident. 
Spatial information of the sort that relates to real estate can be considered to 
require special protection. 

Two factors are important in dealing with privacy-sensitive information. 
First, if information is leaked, then the organization responsible for the informa¬ 
tion risks receiving compensation claims because of privacy protection failure. 
Second, to avoid possible troubles due to potential information leaks, publicized 
data tend to become very rough or vague to avoid potential trouble, often hin¬ 
dering the usefulness of real estate analyses aimed at understanding the market. 

A promising way of dealing with this situation is to protect personal infor¬ 
mation by adding noise to acquired data. A typical example of sensitive infor¬ 
mation is transaction data, which can include transacted prices, real estate or 
transacting person characteristics, and information regarding transaction con¬ 
ditions. Publicized data tend to omit information about characteristics of trans¬ 
acting persons, and hence, such contents are assumed not to be included in the 
database. In this case, one of the most sensitive types of data will be the trans¬ 
acted price. Private information will be protected if noise is added to the price 
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data. However, tactless noise addition seriously distorts data analysis results. 
Therefore, providing a method of adding noise without distorting data analyses 
but still protecting privacy is very important. This study is devoted to proposing 
and applying such a method, assuming that the main concern of the analyses 
is hedonic analysis, i.e., regression analysis with the transacted price being the 
response variable. 

Takemura (2003) reviewed statistical issues in publicizing individual data. 
He listed several methods of protecting personal information, such as (1) direct 
hiding by making the information secret, (2) global categorization by organizing 
values into several coarse classes, and (3) disturbance by replacing actual values 
with different ones (such as swapping by exchanging individual values, the post¬ 
randomization method [PRAM], or the addition of noise). Direct hiding and 
global categorization are not appropriate for releasing data for detailed analyses 
because the resolution of the information can become very coarse. Disturbance 
methods are superior in this aspect, although they usually introduce errors into 
analyses, and such effects must be carefully examined. 

One well-known method of protecting personal information is the statistical 
disclosure limitation (SDL) method. SDL is a general term for methods of pro¬ 
tecting identification of personal data by adding perturbations, modifications, 
or summarization (Shlomo (2010)). The main concern is to reduce identification 
risk as well as to retain data usability. 

Typically, three kinds of methods are often used to reduce identification risk, 
including (1) methods of establishing coarse categorization, (2) methods of gen¬ 
erating new data with statistical characteristics similar to those of the original 
data, and (3) methods of adding noise to the original data (Karr et al. (2006); 
Oganian and Karr (2011)). 

Substantial research has been conducted on methods of establishing coarse 
categorization. In particular, population uniqueness, the feature that a combi¬ 
nation of attributes becomes unique in the parent population, has been studied 
extensively. For example, Manrique-Vallier and Reiter (2012) estimated the risk 
of population uniqueness for discrete data. 

Regarding methods of generating new data, the swapping method, in which 
categorical data are probabilistically exchanged, is well known. One such swap¬ 
ping method is PRAM, which perturbs the exchanging of categorical data 
(Gouweleeuw et al. (1998); Willenborg and Waal (2001)). In this method, a 
transition probability matrix is constructed and then used as the basis for ex¬ 
changing categorical data, while maintaining the original proportions of the 
categories. 

A variety of methods of adding noise, while carefully maintaining qualitative 
features, have been proposed. For example, Oganian and Karr (2011) focused 
on features such as the positivities of values and the magnitude relations be¬ 
tween pairs of values. They proposed a method of adding noise such that the 
positivities of values, mean values, and variance-covariance matrices remain the 
same. One remarkable idea is to use multiplicative noise addition to avoid ob¬ 
taining negative values. Moreover, they demonstrated the stability of results 
after regression analyses. A similar method of maintaining the characteristics 
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of attributes was proposed by Abowd and Woodcock (2001). Another method 
of adding noise to avoid the risk of identification is to introduce random noise 
distributed following a peculiar symmetric distribution with a hole in the center. 
With this method, the perturbed value is never close to the original value, and 
therefore, the risk of identification is drastically reduced. In the actual applica¬ 
tion of this method, the noise distribution is not publicized, hindering analyses 
using the distribution (Reiter (2012)). 

In general, noise addition can influence the quality of subsequent analyses. 
Fuller (1993) noted that noise addition has an influence similar to that of in¬ 
troducing measurement errors to explanatory variables. Several methods have 
been devised to minimize the influence of noise in particular analyses. For exam¬ 
ple, some methods maintain the original mean values and variance-covariance 
matrices (Ting, Fienberg and Trottini (2008); Shlomo and De Waal (2008)). In 
our paper, which focuses on regression analysis, a method is proposed in which 
adding noise produces robust results. 

The paper is organized as follows. In Section 2, we propose a method of 
adding noise to a response variable and show that some important statistics 
do not change with noise addition. In Section 3, numerical experiments are 
conducted to examine how the results of multivariate analyses, apart from the 
assumed regression analysis, can change. Finally, Section 4 concludes with a 
summary and suggests possible extensions of our method. 

2. Theoretical results 

We assume that the n x (p + 1) design matrix X is given by (1„, x \,..., x p ), 
where l n is an n-dimensional vector of ones, and the n-dimensional response 
vector is y. We also assume that n is sufficiently larger than p and that the rank 
of X is p + 1. Then the ordinary least squares (OLS) estimator is 

$ = 0 o J 1 ,...J p y = (X'X)- 1 X l y. 

A decomposition of y based on the OLS estimator (3 is y = y + e where 

y = X$ = X(X'X)- 1 X'y 


is the predictive vector and 

e = y-y=(I n -X (X'X)” 1 X')y 


is the residual vector. Then the coefficient of determination defined by 


R 2 = 1- 


\\y-yin\\ 2 


( 2 . 1 ) 


where y is the sample mean of y , measures the goodness of fit resulting from 
the use of the OLS estimator (3. The coefficient of determination, R 2 , is hence 
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regarded as a key quantity in regression analysis. The t-value of the regression 
coefficient fd 3 for j = 0,1 ,,p, is another key quantity and is defined by 


tj = 


\Jn-p- 1 /3j 
dj ||e| 


( 2 . 2 ) 


where dj is the (j + l)-th diagonal component of ( X'X ) _1 . When Gaussian 
linear regression is performed, tj has a Student’s t-distribution with n — p — 1 
degrees of freedom under the null hypothesis /3j =0. 

The objective of the derivation presented herein is to add perturbation to 
the original response vector and achieve tractable tuning of the R 2 and t-values. 
Any n-dimensional random vector 

v = (vi,.. .,v n y. 


may be used as the starting point. Since n is sufficiently greater than p, v cannot 
be expressed as a linear combination of e, 1 „, aq, ... ,x p with probability one. 
In other words, 

«=(/„- XiX'X^X' - ee'/llef) v, (2.3) 

cannot be the zero vector. The noise vector considered in this paper is a linear 
combination of e and u , given by 


a ll e ll 

1 + b 





(2.4) 


where o / 0 and b > 0. When y + e is used instead of the original response 
vector y , we have the following result. 


Theorem 2.1. 1. The sample mean of y + e is y for any a and b. 

2. The OLS estimator for the response vector y + e remains the same for any 
a and b, that is, 


(.X’X)- l X\y + e) = (X'X)~ 1 X'y. 


3. The t-values for the response vector y + e are given by 

1 + 6 ) 1/2 + 
tj ~ ll + 6 + a(a + 2)/ tji 

for j = 0 ,...,p. 

4- The coefficient of determination for the response vector y + e is 

^ = { 1 + 6 + a(a + 2)(1 — R 2 ) } ^ 

The correlation coefficient of y and y + e is 


1 + 6 + a(l — R 2 ) 

ry ' y+e = (1 + 6)V2{i + 6 + a(a + 2)(1 - 7? 2 )} 1 / 2 ' 


5. 
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Proof. By Part 3 of Lemma 2.1, we have X'e = 0, the first component of which 
is l',e = 0. Hence Part 1 follows. 

Since X'e = 0 , we have 

{X’X)~ 1 X’(y + e) = (X'X^X'y + (X'X)~ 1 X'e = {X'X^X'y (2.5) 
which completes the proof of Part 2. 

Note that the t-values are defined by (2.2). By (2.5), any component of the 
OLS estimator keeps the same. Further \/n — p — 1 /dj does not depend on the 
response vector. Hence Part 3 follows from Part 6 of Lemma 2.1. 

Note the coefficient of determination is defined by (2.1). Since the sample 
mean of y + e is also y as in Part 1 of this theorem, the coefficient of determi¬ 
nation for the response vector y + e is 

\\(I n -X(X'X)^X')(y + e)\\ 2 
\\y + e- yl n \\ 2 


which is rewritten as 

i_ {1 + «(« + 2)/(l + &)} ||e|| 2 

\\y ~ yln|| 2 + {o(a + 2 )/(l + b)} ||e|| 2 

by Parts 5 and 6 of Lemma 2.1. By the definition of R 2 , we have 

l-R 2 = \\e\\ 2 /\\y-yl n \\ 2 , (2.6) 

which completes the proof of Part 4. 

The correlation coefficient of y and y + e is 

(y - yl n )'{y + e - yl n ) 

l|y-j/inlllly + e-yinir 

By Parts 3 and 4 of Lemma 2.1 as well as (2.6), we have 

(■y - yi n)'{y + e - yin) = \\y - yij 2 + {y- yW'e 

= II y - yinf + {y + e- yi n )'e 
= || y - yinf + (X0 - yl n )'e + e'e 

= \\y - yln \\ 2 + e' e 

= \\y-yl n f[l + {a/{l + b)}{l~R 2 )]. 
Further, by Part 5 of Lemma 2.1, we have 

||y - yl n + e\\ 2 = ||y - yl n \\ 2 [l + {a(a + 2)/(l + b)}( 1 - R 2 )] , 

which completes the proof of Part 5. □ 

The lemma below summarizes fundamental properties related to e and e, 
which are needed in the proof of Theorem 2.1. 
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Lemma 2.1. 1. e is orthogonal to l n ,Xi, ... ,x p or equivalently X'e = 0. 

2. u is orthogonal to e,l n ,Xi,... ,x p or equivalently X'u = 0 and e'u = 0. 

3. X'e = 0. 

4. e'e = a||e|| 2 /(l + b) and ||e|| 2 = a 2 ||e|| 2 /(l + b). 

5. The sum of squared deviation of y + e is 

II y + e- yln\\ 2 = II y - yln\\ 2 + {a(a + 2)/(l + b)} ||e|| 2 . 

6. The residual sum of squares for y + e is 

II (In X(X'X)~ 1 X')(y + e)|| 2 = {1 + a(a + 2)/(l + b)} ||e|| 2 . 


Proof. Since X'X(X’X)- 1 X’ = X we have 

(l',e x\e ■ ■ ■ x' p e)' = X'e 

= X' (in-XiX'Xy'X^y 
= (X' - X'XiX'Xy'X') y 

= 0 , 

which completes the proof of Part 1. In the same way, Part 2 can be proved. 
Recall e is given by a linear combination of e and u 1 


a ll e ll 

1 + 6 


e 

lell 


Vb- 


u 


(2.7) 


Then Part 3 follows from Parts 1 and 2. Part 4 follows from the orthogonality 
of e and u together with (2.7). 

Since the sample mean of y + e is y by Part 1 of Theorem 2.1, the sum of 
squared deviation of y + e, ||y + e — yl„|| 2 , is expanded as 

||y — l/l-n || 2 + 2(y — yl n )'e + ||e|| 2 . 


By Part 3, we have 

(y - yl n )'e = (y + e - yl„)'e = (X0 + e- yl n )'e = e'e = a||e|| 2 /(l + 6). 

Then Part 5 follows from Part 4. 

Since X'e = 0 by Part 3, we have 

(In - X(X'X)- 1 X')(y + e) = e + e. 

From Part 4, the residual sum of squares is 

||e + e|| 2 = || e|| 2 + 2e'e + ||e|| 2 = ||e|| 2 + 2a||e|| 2 /(l + 6) + a 2 ||e|| 2 /(l + 6), 


which completes the proof of Part 6. 

By Theorem 2.1, we see that a = —2 is a special case, as follows. 


□ 
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Theorem 2.2. Assume a = —2. Then, we have the followings. 

1. For any b > 0, the coefficient of determination for y + £ is equal to R 2 , 
the coefficient of determination for the original y. 

2. For any b > 0, the t-value of f3j ( j = 0,1 ,,p) for the response vector 
y + e is equal to tj. 

3. The correlation coefficient of y and y + e is 


r y,y+t ~ 1 


2(1 -R 2 ) 
1 + b 


( 2 . 8 ) 


Recall that e is a function of v , any random n-dimensional vector, through 
the relationships, (2.3) and (2.4), that is, 


u = 



X(X'X ) _1 X' 


ee! \ 

w) v 


a ll e ll 

1 + 6 




In Parts 1 and 2 of Theorem 2.2, the choice a = —2 guarantees that the coeffi¬ 
cient of determination and t-value remain the same regardless of v. 

By Part 3 of Theorem 2.2, r y ^ y+e increases with 6 for fixed R 2 . The correlation 
coefficients between the original responses y and perturbed responses y + e with 
a = —2, varying b > 0 and i? 2 , are illustrated in Table 1. 

In actual application, it is desirable to have relatively high correlation, be¬ 
cause data users might assume that the perturbed response is close to the orig¬ 
inal response. However, if the correlation is very high, then the perturbed re¬ 
sponse is very close to the original response, and the objective of concealing 
the actual response cannot be achieved. Thus, it is necessary to determine a 
value of 6 that prevents the perturbed response from being too close to the ac¬ 
tual response, as will be discussed through the analysis of real data in the next 
section. 

Remark 2.1. When a = —2 and b = 0, we have e = —2e as the noise or, 
equivalently 


V - 2e = y - e 


(2.9) 


as the perturbed response. In this case, it is clear that the coefficient of deter¬ 
mination and t-value remain the same, since yi and yi — 2ej for i = 1,..., n are 
symmetric with respect to the point = J/i — ej. Since the noise e = — 2e does 
not depend on v , there is no randomness in the noise. Theorem 2.2 ensures that, 
for random v , as in (2.4), it is possible to construct the noise e such that the 
coefficient of determination and t-value remain the same. 


Remark 2.2. As in Theorem 2.2, the choice a = —2 with random v was surpris¬ 
ingly found to retain the R 2 and t values. Following are some remarks for the 
other choices. For a £ (—oo,—2) U (0,oo), both R 2 and the absolute value of f 
values are reduced. For example, 6 > 0 and a = —1±\/6 + 2 £ (—oo, — 2)U(0, oo) 
yield 


tj — y j2'j'^' 


R 2 

2 -R 2 


< R 2 . 


( 2 . 10 ) 







Y. Maruyama et al./Method for Noise Addition 


Table 1 

Correlation coefficient of y and y + e with a = — 2 


R 2 \b 

0 

0.25 

0.5 

0.75 

1.0 

1.25 

1.5 

1.75 

2.0 

0.4 

-0.2 

0.04 

0.2 

0.31 

0.4 

0.47 

0.52 

0.56 

0.6 

0.6 

0.2 

0.36 

0.47 

0.54 

0.6 

0.64 

0.68 

0.71 

0.73 

0.8 

0.6 

0.68 

0.73 

0.77 

0.8 

0.82 

0.84 

0.85 

0.87 


Note that R 2 and t values can be completely controlled. Thus the data provider 
safely provide data with the relation between {tj, R 2 } and {tj,R 2 } described by 
(2.10), and practitioners can restore the original R 2 and f-value independently. 
An efficient method of opening data with reduced accuracy will be reported 
elsewhere. 


3. Numerical experiment 

In the previous section, a method was proposed to add noise to the response 
variable. This method can be applied when a real estate database is released into 
the public domain, by adding noise to the transacted price, which is considered 
to be sensitive information in Japan. As Theorem 2.2 in the previous section 
ensures, the results of regression analysis using the perturbed data will not 
change. However, in actual application, a variety of analyses will be devised, 
and the theorems do not apply in cases with unexpected applications. Thus, it 
is necessary to verify whether the proposed method remains appropriate even 
in unexpected applications. 

The precision of the results might be degraded if analytical operations not 
assumed in the theory are applied. In such cases, permissible error levels result¬ 
ing from the perturbation must be determined. In the following, a numerical 
experiment to determine the relationship between perturbation and precision 
level is discussed. 

3.1. Data used in the experiment 

The data source used for the numerical experiment was At Home Co. Ltd. The 
data contained real estate advertisement information from 2008. The database 
for the experiment was created by supplementing some spatial variables. It con¬ 
tained 1, 320 cases of newly built detached houses in Setagaya Ward in Tokyo 
Prefecture 1 . The variables included the price of the property (yen), the time 
to the nearest railway station (minutes), a dummy variable representing bus 
usage, the area of the site (square meters), the floor area (square meters), a 
dummy variable signifying leased land, the designated building coverage ratio, 

1 We selected data that contained information about the designated floor area and building 
coverage ratios. Such data are thought to be important in real estate analysis in Japan. In 
the original database, a new record was added each time a property owner changed the price 

in the advertisement. In such situations, we selected only the newest record. 
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the designated floor area ratio, the time to Shinjuku by rail from the nearest 
station (minutes), the time to Shibuya by rail (minutes), the time to Yokohama 
by rail (minutes), the time to Tokyo by rail (minutes), the width of the nearest 
road (meters), and a dummy variable signifying the nearest road to the south of 
lot. Note that Shinjuku, Shibuya, Yokohama, and Tokyo are four major railway 
stations in the study region. Among these variables, the times to the railway 
stations, width of the nearest road, and dummy variable signifying the nearest 
road to the south are spatial variables, as described in the next subsection. 

3.2. Creation of spatial variables 

The times to the major railway stations from the nearest station; the width of 
the nearest road, as measured from the representative point of the property; and 
the dummy variable signifying whether the nearest road is located to the south 
of the property were added to the original database as variables for signifying 
spatial relationships. The width of the nearest road to the representative point 
of the property was regarded as the width of the nearest road, which was done 
because precise digital data for lots are not available. Accordingly, the dummy 
variable signifying whether the nearest road was located to the south of the 
property was regarded as the dummy variable signifying the nearest road to the 
south of lot. 

The times to the major railway stations from the nearest station were calcu¬ 
lated using the search system for guiding transferring railways provided by NAV- 
ITIME Japan Co. Ltd. This system automatically calculates the time required 
to travel to the major railway stations, i.e., Shinjuku, Shibuya, Yokohama, and 
Tokyo, from the railway station nearest to the property. To determine the times 
required in this study, the departing time was set to 12 : 00 (noon) on August 
2 , 2010 . 

The width of the nearest road from the representative point of the property 
was calculated as follows. Mapple 10000 digital data produced by Shobunsha 
Publications Inc. contain digital road data classified by road width categories, 
such as 4-5m and 5-6m. The median of each class was assigned as the road 
width. For example, a width of 4.5m was used for the 4-5m class. With the 
geographic information system (GIS) software ArcGIS 10, the nearest road was 
assigned for each property, and the width of the road calculated as described 
above was set to be the width of road nearest to the representative point of the 
property. 

In the real estate market in Japan, a residential lot tends to be evaluated 
highly if it is adjacent to a road to the south of lot, because receiving substan¬ 
tial sunlight is preferred in Tokyo. For example, The Real Estate Transaction 
Modernization Center (1986) treats properties adjacent to roads to the south of 
lots more favorably in their property appraisals. With this preference in mind, 
the dummy variable signifying whether the nearest road is located to the south 
of the property was also added to the database. 

This dummy variable was constructed as follows. Using ArcGIS 10, the direc¬ 
tion to the nearest road was calculated, such that 0° was located to the east, and 
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Table 2 

Summary of variable statistics 



min 

max 

mean 

s.d. 

price of the property (yen) 

34800000 

330000000 

72431491 

25539447 

time to the nearest railway station (minutes) 

0 

25 

10.60 

4.83 

d.v. a representing bus usage 

0 

1 

0.07 

0.26 

area of the site (square meters) 

29.53 

211.49 

88.56 

25.48 

floor area (square meters) 

47.07 

228.48 

98.94 

20.06 

d.v. a signifying leased land 

0 

1 

0.03 

0.17 

designated building coverage ratio 

40 

80 

54.18 

7.70 

designated floor area ratio 

80 

300 

141.43 

47.10 

time to Shinjuku by rail (minutes) 

5 

32 

18.72 

5.29 

time to Shibuya by rail (minutes) 

3 

29 

14.86 

6.01 

time to Yokohama by rail (minutes) 

17 

64 

44.30 

10.99 

time to Tokyo by rail (minutes) 

23 

48 

34.09 

4.90 

width of the nearest road (meters) 

4.5 

35 

5.80 

2.25 

d.v. a signifying the nearest road to the south of lot 

0 

1 

0.28 

0.45 


a d.v. stands for “dummy variable”. 


the value increased to 180° counterclockwise and decreased to —180° clockwise. 
The range from —135° to —45° was judged to be to the south, in which case the 
dummy variable was set to one, and it was set to zero otherwise. The statistics 
of the variables are summarized in Table 2. 


3.3. Numerical experiment with perturbed property price 

The perturbed property price, which was generated by adding noise to the 
response variable using the method described in the previous section, was nu¬ 
merically tested as described in this subsection. The explanatory variables used 
were the 13 variables in Table 2. 

3.3.1. Statistics of the perturbed property price 

Although Part 1 of Theorem 2.1 guarantees that the mean of the perturbed re¬ 
sponse variable is exactly equal to the mean of the original response, the equality 
or similarity of the other statistics, such as the minimum value, maximum value, 
and first and third quantiles, theoretically cannot be controlled. In this section, 
the generation of four sets of quasi-response variables with different v values, 
a = —2, and b = 1 is described, to analyze the degrees of perturbation of the 
statistics among the five sets, including the original response (original, quasil, 
quasi2, quasi3, and quasi4). 

Figure 1 shows boxplots of the five sets. When the original and quasi-response 
variables are compared, the medians are very similar, but the quantiles, minima, 
and maxima are quite different. It is also evident that, among the four sets of 
quasi variables, all of the statistics are similar. Figure 2 shows scatterplots of 
the original variables and of the four sets of quasi variables. Although the plots 
for the four sets of quasi variables appear very similar, the different v values 
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CD 

O 

LO 



orginal quasil quasi2 quasi3 quasi4 


Fig 1. Boxplots of five sets of response variables 


imply different quasi variables, as explained in Section 2. Table 3 provides the 
correlation matrix for the five sets of response variables. By Part 3 of Theorem 
2.2, the correlation coefficient between the original and quasi variables is given 
theoretically by 1 — 2(1 — R 2 )/( 1 + 6), which equals R 2 for 6 = 1. Among the 
quasi variables, the correlations in all cases are approximately 0.78. 

Remark 3.1. In this particular data set, the response variable was the property 
price, which was expected to be positive. Hence, a positive perturbed price is 
strongly desirable. As claimed in Remark 2.1, for sufficiently small 6, we have 

y + e^y-e. 

Suppose there exist individuals i with relatively expensive prices j/i, when rela¬ 
tively lower prices t/j are expected. Then e* increases, and as a result 

Vi + £i ~ Vi ~ e I < 0 (3.1) 

can occur. In our data set, such situations rarely occurred for b = 1.2 or less 
and never occurred for b = 1.3 or greater. To the best of our knowledge, the 
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O.Oe+OO 1.0e+08 2.0e+08 3.0e+08 O.Oe+OO 1.0e+08 2.0e+08 3.0e+08 


original original 



O.Oe+OO 1.0e+08 2.0e+08 3.0e+08 O.Oe+OO 1.0e+08 2.0e+08 3.0e+08 


original original 


Fig 2. Scatterplots of five sets of response variables 
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Table 3 

Correlations between original response and four sets of quasi responses 



orig 

quasi 1 

quasi2 

quasi3 

quasi4 

orig 

1 

0.7748 

0.7748 

0.7748 

0.7748 

quasi 1 

0.7748 

1 

0.7693 

0.7749 

0.7814 

quasi2 

0.7748 

0.7693 

1 

0.7870 

0.7812 

quasi3 

0.7748 

0.7749 

0.7870 

1 

0.7733 

quasi4 

0.7748 

0.7814 

0.7812 

0.7733 

1 


occurrence of such situations is theoretically not controllable through the choices 
of b and v. When (3.1) occurs, it is recommended to generate e with different 
v values until min{y, + e^} > 0 is achieved. 

3.3.2. Regression analysis using only a portion of the database 

The theory assumes that all of the data will be used for the analysis. If only a 
portion of the perturbed data is used, then the theorems do not apply exactly. In 
actual analyses for real estate data, only a portion of the (perturbed) database 
is used for the analysis. In such cases, it is necessary to know how the results 
might differ from the theoretical results and to follow the subsequently described 
guidelines to choose an appropriate value of b. 

For this purpose, a critical value of b may be obtained such that the difference 
between the regression model using the perturbed property price as the response 
variable and the original property price is not statistically significant. 

3.3.3. Chow test 

From 1,320 cases (total database), 20% (i.e., 264 cases) were selected randomly, 
and perturbed prices were generated for 13 b values (i.e., b = 0.5, 0.6, 0.7, 0.8, 
0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 2.0, 2.5). The Chow test was applied to determine 
whether the regression models with the original and the perturbed prices could 
be regarded as the same model. For each value of b, 1, 000 independently chosen 
samples were created and analyzed. As a result, for each value of 5, 1,000 values 
of the Chow test F value were derived. Ordering these values by magnitude, 
5%, 10%, 50%, 90% and 95% (i.e., the 50 th , 100 th , 500 th , 900 th and 950 th value) 
of the points of the F value were derived. Figure 3 shows that larger values 
of b correspond to smaller .F-value variations. Given the objective of choosing 
an appropriate value of b to generate a properly perturbed property price, the 
minimum value of b for which the null hypothesis of the Chow test (namely, 
Ho'. “There is no statistically significant difference between two models”) is not 
rejected can be considered the critical value of b. Note that the F value in the 
F distributions with degrees of freedom 14 and 500 that achieves a significance 
level 0.05 is F = 1.71. Hence, if F is less than 1.71, then the null hypothesis 
cannot be rejected, and therefore the two models can be regarded as statistically 
the same. 
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Fig 3. The relation between F value and b 

For each value of b , the percentage of F values among 1,000 trials that 
satisfied the acceptance condition of F less than 1.71 was calculated. In our 
numerical experiment, these percentages are 65.0%, 97.0%, and 100% for b = 0.5, 
b = 1.0, and b > 1.4, respectively, as seen in Table 4. 

3.3-4- Recommended standard for b value 

In the numerical experiment described above, when 20% was selected randomly, 
the Chow tests used to test the identity of the two models, that is, the regression 
models with the actual and perturbed property prices as the response variables, 
demonstrated that the F value satisfied the acceptance condition with 97.0% 
probability when b = 1.0 and 100% probability when b > 1.4. Assuming that 
approximately 5% is the permissible level for hypothesis rejection (i.e., that two 
models cannot be regarded as the same), b = 1.0 is judged appropriate, as it 
ensures that the perturbed price is perturbed sufficiently and, nonetheless, that 
the regression model with the perturbed price can be regarded as identical to 
the regression model with the original price. The appropriate value of b differs 
if another percentage is used to select the sample. For instance, we let q be the 
percentage used to select the sample and, using the above numerical experiment, 
we let q = 0.2 (20%). Assuming a 5% rejection level, the critical value of 6, &*, 
such that for b less than 6* the probability of rejection becomes greater than 
5%, was calculated by changing q. Table 4 summarizes the results. For all q 
values investigated in this study, 6* = 1.0 appears to be a reasonable choice, as 
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Table 4 

Percentages of samples that accepted the null hypothesis, for which the perturbed sample can 
be regarded as statistically identical to the original sample for sample selection percentage, 

q, and b value 


b\q 

0.05 

0.10 

0.20 

0.30 

0.40 

0.50 

0.60 

0.70 

0.80 

0.90 

0.5 

0.591 

0.576 

0.650 

0.728 

0.827 

0.918 

0.969 

0.992 

0.996 

1.000 

0.6 

0.686 

0.670 

0.744 

0.808 

0.903 

0.947 

0.988 

0.997 

0.999 

1.000 

0.7 

0.729 

0.764 

0.808 

0.873 

0.943 

0.976 

0.994 

1.000 

0.999 

1.000 

0.8 

0.815 

0.845 

0.858 

0.928 

0.966 

0.988 

0.996 

1.000 

1.000 

1.000 

0.9 

0.867 

0.867 

0.931 

0.966 

0.989 

1.000 

0.998 

1.000 

1.000 

1.000 

1.0 

0.930 

0.925 

0.970 

0.987 

0.995 

0.998 

1.000 

1.000 

1.000 

1.000 

1.1 

0.960 

0.968 

0.983 

0.998 

0.997 

1.000 

1.000 

1.000 

1.000 

1.000 

1.2 

0.978 

0.987 

0.994 

0.999 

0.999 

1.000 

1.000 

1.000 

1.000 

1.000 

1.3 

0.994 

0.994 

0.995 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.4 

0.996 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.5 

0.999 

0.998 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 


it balances the similarity and the perturbation to the original price. 

4. Conclusion 

This paper proposed a new method of perturbing a major variable by adding 
noise, while ensuring that the results of regression analysis are not affected. The 
extent of the perturbation can be controlled using a single parameter, b, which 
eases actual perturbation application. Moreover, b = 1.0 can be regarded as an 
appropriate value for achieving both sufficient perturbation to mask the original 
values and sufficient coherence between the perturbed and original data. 

The proposed method masks only one major variable, but in actual applica¬ 
tion, many situations may be encountered in which only one variable is critical 
to put the entire dataset in the public domain. Our method will be useful in 
such situations. There are other possible uses of perturbed data, and the ap¬ 
propriateness of the b value must be examined by testing a greater variety of 
data-use cases. Admittedly, application of the proposed method is limited, be¬ 
cause other variables are assumed to retain their original values. Thus, further 
methods of perturbing the explanatory variables are necessary to broaden the 
range of applications. Such extensions will be provided in subsequent work. 
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